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Abstract 

We describe an annotation scheme and a 
tool developed for creating linguistically 
annotated corpora for non-configurational 
languages. Since the requirements for such 
a formalism differ from those posited for 
configurational languages, several features 
have been added, influencing the architec- 
ture of the scheme. The resulting scheme 
reflects a stratificational notion of lan- 
guage, and makes only minimal assump- 
tions about the interrelation of the partic- 
ular representational strata. 

1 Introduction 

The work reported in this paper aims at provid- 
ing syntactically annotated corpora ('treebanks') for 
stochastic grammar induction. In particular, we fo- 
cus on several methodological issues concerning the 
annotation of non-configurational languages. 

In section 0, we examine the appropriateness of 
existing annotation schemes. On the basis of these 
considerations, we formulate several additional re- 
quirements. A formalism complying with these re- 
quirements is described in section |[ Section ^ deals 
with the treatment of selected phenomena. For a 
description of the annotation tool see section 0. 

2 Motivation 

2.1 Linguistically Interpreted Corpora 

Combining raw language data with linguistic infor- 
mation offers a promising basis for the development 
of new efficient and robust NLP methods. Real- 
world texts annotated with different strata of lin- 
guistic information can be used for grammar induc- 
tion. The data-drivenness of this approach presents 
a clear advantage over the traditional, idealised no- 
tion of competence grammar. 

2.2 Existing Treebank Formats 

Corpora annotated with syntactic structures are 
commonly referred to as treebanks. Existing tree- 



bank annotation schemes exhibit a fairly uniform 
architecture, as they all have to meet the same basic 
requirements, namely: 

Descriptivity: Grammatical phenomena are to be 
described rather than explained. 

Theory-independence: Annotations should not 
be influenced by theory-speciflc considerations. 
Nevertheless, different theory-specific represen- 
tations shall be recoverable from the annota- 
tion, cf. (Marcus et al., 1994). 

Multi-stratal representation: Clear separation 
of different description levels is desirable. 

Data-drivenness: The scheme must provide rep- 
resentational means for all phenomena occur- 
ring in texts. Disambiguation is based on hu- 
man processing skills (cf. (Marcus et al., 1994), 
(Sampson, 1995), (Black et al. , 1996)). 

The typical treebank architecture is as follows: 

Structures: A context-free backbone is augmented 
with trace-filler representations of non-local de- 
pendencies. The underlying argument structure 
is not represented directly, but can be recovered 
from the tree and trace-filler annotations. 

Syntactic category is encoded in node labels. 

Grammatical functions constitute a complex la- 
bel system (cf. (Bics et al., 1995), (Sampson, 
1995)). 

Part-of-Speech is annotated at word level. 

Thus the context-free constituent backbone plays 
a pivotal role in the annotation scheme. Due to 
the substantial differences between existing models 
of constituent structure, the question arises of how 
the theory independence requirement can be satis- 
fied. At this point the importance of the underlying 
argument structure is emphasised (cf. (Lehmann et 
al., 1996), (Marcus et al., 1994), (Sampson, 1995)). 

2.3 Language-Specific Features 

Treebanks of the format described in the above sec- 
tion have been designed for English. Therefore, the 



solutions they offer are not always optimal for other 
language types. As for free word order languages, 
the following features may cause problems: 

• local and non-local dependencies form a contin- 
uum rather than clear-cut classes of phenom- 
ena; 

• there exists a rich inventory of discontinuous 
constituency types (topicalisation, scrambling, 
clause union, pied piping, extraposition, split 
NPs and PPs); 

• word order variation is sensitive to many fac- 
tors, e.g. category, syntactic function, focus; 

• the grammaticality of different word permuta- 
tions does not fit the traditional binary 'right- 
wrong' pattern; it rather forms a gradual tran- 
sition between the two poles. 

In light of these facts, serious difficulties can be ex- 
pected arising from the structural component of the 
existing formalisms. Due to the frequency of dis- 
continuous constituents in non-configurational lan- 
guages, the filler-trace mechanism would be used 
very often, yielding syntactic trees fairly different 
from the underlying predicate- argument structures. 

Consider the German sentence 

(1) daran wird ilin Anna erfeennen, dafi er weint 
at-it will him Anna recognise that he cries 

'Anna will recognise him at his cry' 
A sample constituent structure is given below: 
S 




daran e^, wird ihn Anna e^2 ^#3 erkennen, dass er weint 

The fairly short sentence contains three non- 
local dependencies, marked by co-references between 
traces and the corresponding nodes. This hybrid 
representation makes the structure less transparent, 
and therefore more difficult to annotate. 

Apart from this rather technical problem, two fur- 
ther arguments speak against phrase structure as the 
structural pivot of the annotation scheme: 

• Phrase structure models stipulated for non- 
configurational languages differ strongly from 
each other, presenting a challenge to the in- 
tended theory-independence of the scheme. 

• Constituent structure serves as an explanatory 
device for word order variation, which is difficult 
to reconcile with the descriptivity requirement. 



Finally, the structural handling of free word or- 
der means stating well-formedness constraints on 
structures involving many trace-filler dependencies, 
which has proved tedious. Since most methods of 
handling discontinuous constituents make the for- 
malism more powerful, the efficiency of processing 
deteriorates, too. 

An alternative solution is to make argument struc- 
ture the main structural component of the formal- 
ism. This assumption underlies a growing number of 
recent syntactic theories which give up the context- 
free constituent backbone, cf. (McCawley, 1987), 
(Dowty, 1989), (Reape, 1993), (Kathol and Pollard, 
1995). These approaches provide an adequate ex- 
planation for several issues problematic for phrase- 
structure grammars (clause union, extraposition, di- 
verse second-position phenomena). 

2.4 Annotating Argument Structure 

Argument structure can be represented in terms of 
unordered trees (with crossing branches) . In order to 
reduce their ambiguity potential, rather simple, 'flat' 
trees should be employed, while more information 
can be expressed by a rich system of function labels. 

Furthermore, the required theory-independence 
means that the form of syntactic trees should not 
reflect theory-specific assumptions, e.g. every syn- 
tactic structure has a unique head. Thus, notions 
such as head should be distinguished at the level of 
syntactic functions rather than structures. This re- 
quirement speaks against the traditional sort of de- 
pendency trees, in which heads are represented as 
non-terminal nodes, cf. (Hudson, 1984). 

A tree meeting these requirements is given below: 




Adv V NP NP V CPL NP V 

daran wird ihn Anna erkennen, dass er weint 

Such a word order independent representation has 
the advantage of all structural information being en- 
coded in a single data structure. A uniform repre- 
sentation of local and non-local dependencies makes 
the structure more transparent^. 

3 The Annotation Scheme 

3.1 Architecture 

We distinguish the following levels of representation: 



^A context-free constituent backbone can still be re- 
covered from the surface string and argument structure 
by reattaching 'extracted' structures to a higher node. 



Argument structure, represented in terms of mi- 
ordered trees. 

Grammatical functions, encoded in edge labels, 
e.g. SB (subject), MO (modifier), HD (head). 

Syntactic categories, expressed by category la- 
bels assigned to non-terminal nodes and by 
part-of-speech tags assigned to terminals. 

3.2 Argument Structure 

A structure for (H) is shown in fig. ||. 

(2) schade, dafi kein Arzt anwesend ist, der 
pity that no doctor present is who 
sich auskennt 
is competent 
'Pity that no competent doctor is here' 

Note that the root node does not have a head de- 
scendant (HD) as the sentence is a predicative con- 
struction consisting of a subject (SB) and a predi- 
cate (PD) without a copula. The subject is itself a 
sentence in which the copula (ist) does occur and is 
assigned the tag HD^. 

The tree resembles traditional constituent struc- 
tures. The difference is its word order independence: 
structural units ("phrases") need not be contigu- 
ous substrings. For instance, the extraposed relative 
clause (RC) is still treated as part of the subject NP. 

As the annotation scheme does not distinguish 
different bar levels or any similar intermediate cat- 
egories, only a small set of node labels is needed 
(currently 16 tags, S, NP, AP . . . ). 

3.3 Grammatical Functions 

Due to the rudimentary character of the argument 
structure representations, a great deal of information 
has to be expressed by grammatical functions. Their 
further classification must reflect different kinds of 
linguistic information: morphology (e.g., case, in- 
flection), category, dependency type (complementa- 
tion vs. modiflcation), thematic role, etc.'^ 

However, there is a trade-off between the gran- 
ularity of information encoded in the labels and 
the speed and accuracy of annotation. In order to 
avoid inconsistencies, the corpus is annotated in two 
stages: basic annotation and refinement. While in 
the flrst phase each annotator has to annotate struc- 
tures as well as categories and functions, the reflne- 
ment can be done separately for each representation 
level. 

During the first phase, the focus is on annotat- 
ing correct structures and a coarse-grained classifi- 
cation of grammatical functions, which represent the 
following areas of information: 



^CP stands for complementizer, OA for accusative 
object and RC for relative clause. NK denotes a 'kernel 
NP' component (v. section [l.l| ). 

■^For an extensive use of grammatical functions cf. 
(Karlsson et al., 1995), (Voutilainen, 1994). 



Dependency type: complements are further clas- 
sified according to features such as category and 
case: clausal complements (OC), accusative ob- 
jects (OA), datives (DA), etc. Modifiers are as- 
signed the label MO (further classification with 
respect to thematic roles is planned). Sepa- 
rate labels are defined for dependencies that 
do not fit the complement/modifier dichotomy, 
e.g., pre- (GL) and postnominal genitives (GR). 

Headedness versus non-headedness: 

Headed and non-headed structures are distin- 
guished by the presence or absence of a branch 
labeled HD. 

Morphological information: Another set of la- 
bels represents morphological information. PM 
stands for morphological particle, a label for 
German infinitival zu and superlative am. Sep- 
arable verb prefixes are labeled SVP. 

During the second annotation stage, the anno- 
tation is enriched with information about thematic 
roles, quantifier scope and anaphoric reference. As 
already mentioned, this is done separately for each 
of the three information areas. 

3.4 Structure Sharing 

A phrase or a lexical item can perform multiple func- 
tions in a sentence. Consider equi verbs where the 
subject of the infinitival VP is not realised syntac- 
tically, but co-referent with the subject or object of 
the matrix equi verb: 

(3) er bat mich zu kommen 
he asked me to come 

{mich is the understood subject of kommen). In such 
cases, an additional edge is drawn from the embed- 
ded VP node to the controller, thus changing the 
syntactic tree into a graph. We call such additional 
edges secondary links and represent them as dotted 
lines, see fig. ^, showing the structure of (|3|). 

4 Treatment of Selected Phenomena 

As theory-independence is one of our objectives, 
the annotation scheme incorporates a number of 
widely accepted linguistic analyses, especially in 
the area of verbal, adverbial and adjectival syn- 
tax. However, some other standard analyses turn 
out to be problematic, mainly due to the partial, 
idealised character of competence grammars, which 
often marginalise or ignore such important phenom- 
ena as 'deficient' (e.g. headless) constructions, ap- 
positions, temporal expressions, etc. 

In the following paragraphs, we give annotations 
for a number of such phenomena. 

4.1 Noun Phrases 

Most linguistic theories treat NPs as structures 
headed by a unique lexical item (noun). How- 
ever, this idealised model needs several additional 



assumptions in order to account for such important 
phenomena as complex nominal NP components (cf. 
(m) or nominalised adjectives (cf. (H)). 

(4) my uncle Peter Smith 

(5) der sehr Gliickliche 
the very happy 

'the very happy one' 

In (Q), different theories make different headedness 
predictions. In (0), either a lexical nominalisation 
rule for the adjective Gliickliche is stipulated, or the 
existence of an empty nominal head. Moreover, the 
so-called DP analysis views the article der as the 
head of the phrase. Further differences concern the 
attachment of the degree modifier sehr. 

Because of the intended theory-independence of 
the scheme, we annotate only the common mini- 
mum. We distinguish an NP kernel consisting of 
determiners, adjective phrases and nouns. All com- 
ponents of this kernel are assigned the label NK and 
treated as sibling nodes. 

The difference between the particular NK's lies in 
the positional and part-of-speech information, which 
is also sufHcient to recover theory-specific structures 
from our 'underspecified' representations. For in- 
stance, the first determiner among the NK's can be 
treated as the specifier of the phrase. The head of 
the phrase can be determined in a similar way ac- 
cording to theory-specific assumptions. 

In addition, a number of clear-cut NP components 
can be defined outside that juxtapositional kernel: 
pre- and postnominal genitives (GL, GR), relative 
clauses (RC), clausal and sentential complements 
(OC). They are all treated as siblings of NK's re- 
gardless of their position (in situ or extraposed). 

4.2 Attachment Ambiguities 

Adjunct attachment often gives rise to structural 
ambiguities or structural uncertainty. However, full 
or partial disambiguation takes place in context, and 
the annotators do not consider unrealistic readings. 

In addition, we have adopted a simple convention 
for those cases in which context information is insuf- 
ficient for total disambiguation: the highest possible 
attachment site is chosen. 

A similar convention has been adopted for con- 
structions in which scope ambiguities have syntac- 
tic effects but a one-to-one correspondence between 
scope and attachment does not seem reasonable, cf. 
focus particles such as only or also. If the scope of 
such a word does not directly correspond to a tree 
node, the word is attached to the lowest node dom- 
inating all subconstitucnts appearing in its scope. 

4.3 Coordination 

A problem for the rudimentary argument structure 
representations is the use of incomplete structures 



in natural language, i.e. phenomena such as coor- 
dination and ellipsis. Since a precise structural de- 
scription of non-constituent coordination would re- 
quire a rich inventory of incomplete phrase types, we 
have agreed on a sort of underspecified representa- 
tions: the coordinated units are assigned structures 
in which missing lexical material is not represented 
at the level of primary links. Fig. shows the rep- 
resentation of the sentence: 

(6) sie wurde von preufiischen Truppen besetzt 
she was by Prussian troops occupied 
und 1887 dem preufiischen Staat angeghedert 
and 1887 to-the Prussian state incorporated 

'it was occupied by Prussian troops and incorpo- 
rated into Prussia in 1887' 

The category of the coordination is labeled CVP 
here, where C stands for coordination, and VP for 
the actual category. This extra marking makes it 
easy to distinguish between 'normal' and coordi- 
nated categories. 

Multiple coordination as well as enumerations are 
annotated in the same way. An explicit coordinating 
conjunction need not be present. 

Structure-sharing is expressed using secondary 
links. 

5 The Annotation Tool 

5.1 Requirements 

The development of linguistically interpreted cor- 
pora presents a laborious and time-consuming task. 
In order to make the annotation process more effi- 
cient, extra effort has been put into the development 
of an annotation tool. 

The tool supports immediate graphical feedback 
and automatic error checking. Since our scheme per- 
mits crossing edges, visualisation as bracketing and 
indentation would be insufficient. Instead, the com- 
plete structure should be represented. 

The tool should also permit a convenient han- 
dling of node and edge labels. In particular, variable 
tagsets and label collections should be allowed. 

5.2 Implementation 

As the need for certain functionalities becomes ob- 
vious with growing annotation experience, we have 
decided to implement the tool in two stages. In the 
first phase, the main functionality for building and 
displaying unordered trees is supplied. In the sec- 
ond phase, secondary links and additional structural 
functions are supported. The implementation of the 
first phase as described in the following paragraphs 
is completed. 

As keyboard input is more efficient than mouse 
input (cf. (Lehmann et al., 1996)) most effort has 
been put in developing an efficient keyboard inter- 
face. Menus are supported as a useful way of getting 



help on commands and labels. In addition to pure 
annotation, we can attach comments to structures. 
Figure W shows a screen dump of the tool. The 
largest part of the window contains the graphi- 
cal representation of the structure being annotated. 
The following commands are available: 

• group words and/or phrases to a new phrase; 

• ungroup a phrase; 

• change the name of a phrase or an edge; 

• re- attach a node; 

• generate the postscript output of a sentence. 

The three tagsets used by the annotation tool 
(for words, phrases, and edges) are variable and are 
stored together with the corpus. This allows easy 
modification if needed. The tool checks the appro- 
priateness of the input. 

For the implementation, we used Tcl/Tk Version 
4.1. The corpus is stored in a SQL database. 

5.3 Automation 

The degree of automation increases with the amount 
of data available. Sentences annotated in previous 
steps are used as training material for further pro- 
cessing. We distinguish five degrees of automation: 

0) Completely manual annotation. 

1) The user determines phrase boundaries and 
syntactic categories (S, NP, etc.). The program 
automatically assigns grammatical function la- 
bels. The annotator can alter the assigned tags. 

2) The user only determines the components of a 
new phrase, the program determines its syntac- 
tic category and the grammatical functions of 
its elements. Again, the annotator has the op- 
tion of altering the assigned tags. 

3) Additionally, the program performs simple 
bracketing, i.e., finds 'kernel' phrases. 

4) The tagger suggests partial or complete parses. 

So far, about 1100 sentences of our corpus have 
been annotated. This amount of data suffices as 
training material to reliably assign the grammatical 
functions if the user determines the elements of a 
phrase and its type (step 1 of the list above) . 

5.4 Assigning Grammatical Fiinction 
Labels 

Grammatical functions are assigned using standard 
statistical part-of-speech tagging methods (cf. e.g. 
(Cutting et al., 1992) and (Feldweg, 1995)). 

For a phrase Q with children of type T^, . . . , Tfc 
and grammatical functions G^, . . . ,Gk, we use the 
lexical probabilities 

PQiG^\T,) 

and the contextual (trigram) probabilities 



The lexical and contextual probabilities are deter- 
mined separately for each type of phrase. During 
annotation, the highest rated grammatical function 
labels Gi are calculated using the Viterbi algorithm 
and assigned to the structure, i.e., we calculate 

k 

argmaxJ]PQ(r,|r,_i,r,_2) • PQ{G^\T,). 



To keep the human annotator from missing errors 
made by the tagger, we additionally calculate the 
strongest competitor for each label Gt. If its proba- 
bility is close to the winner (closeness is defined by 
a threshold on the quotient), the assignment is re- 
garded as unreliable, and the annotator is asked to 
confirm the assignment. 

For evaluation, the already annotated sentences 
were divided into two disjoint sets, one for training 
(90% of the corpus), the other one for testing (10%). 
The procedure was repeated 10 times with different 
partitionings. 

The tagger rates 90% of all assignments as reliable 
and carries them out fully automatically. Accuracy 
for these cases is 97%. Most errors are due to wrong 
identification of the subject and different kinds of 
objects in sentences and VPs. Accuracy of the unre- 
liable 10% of assignments is 75%, i.e., the annotator 
has to alter the choice in 1 of 4 cases when asked for 
confirmation. Overall accuracy of the tagger is 95%. 

Owing to the partial automation, the average an- 
notation efficiency improves by 25% (from around 4 
minutes to 3 minutes per sentence). 

6 Conclusion 

As the annotation scheme described in this paper fo- 
cusses on annotating argument structure rather than 
constituent trees, it differs from existing treebanks in 
several aspects. These differences can be illustrated 
by a comparison with the Penn Treebank annotation 
scheme. The following features of our formalism are 
then of particular importance: 

• simpler (i.e. 'flat') representation structures 

• complete absence of empty categories 

• no special mechanisms for handling discontinu- 
ous constituency 

The current tagset comprises only 16 node labels 
and 34 function tags, yet a finely grained classifica- 
tion will take place in the near future. 

We have argued that the selected approach is bet- 
ter suited for producing high quality interpreted cor- 
pora in languages exhibiting free constituent order. 
In general, the resulting interpreted data also are 
closer to semantic annotation and more neutral with 
respect to particular syntactic theories. 

As modern linguistics is also becoming more aware 
of the importance of larger sets of naturally occur- 



p G eneral : 






Corpus: lR£fCotpu£T£stt<opie_^ 


llfj 


Editor: [Thorsten 


||4J 


■ Parser 


Ok 


Reload 


Exit 













r- Sentence: 



No.: 4/1269 
Comment: | 



Last edited: Thorsten, 07/02/97, 17:39:29 



Origin: refcorp.tt 



[Fip] [mo] 



m 



-©> 




<NP> 
500 
IMKI [MKI 



Es spielt eben keine Rolle 

"^ 1 2 3 4 5 




m 



die Musik gef'allia ist - nur etwas " Neues " mu"q 

PF^RVVFIN^ 



<L 



H 2 



-Move: 



Prev 


Next 




-10 


+10 




-100 


+100 



Go to: J 






D FHter 


yask,,. 



Matches: 



"Dependency: 
Selection: F" 



Command: [^ 



IE 



Execute 



• Parentlabel: 



Node no.: [^ 
Parentlabel: [^ 






M®Kt 


Ef^v 


End 



[^i^inE^^^linMn^Z°~ZI~£°nEi 
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ring data, interpreted corpora are a valuable re- 
source for theoretical and descriptive linguistic re- 
search. In addition the approach provides empiri- 
cal material for psycholinguistic investigation, since 
preferences for the choice of certain syntactic con- 
structions, linearizations, and attachments that have 
been observed in online experiments of language pro- 
duction and comprehension can now be put in rela- 
tion with the frequency of these alternatives in larger 
amounts of texts. 

Syntactically annotated corpora of German have 
been missing until now. In the second phase of the 
project Verbmobil a treebank for 30,000 German 
spoken sentences as well as for the same amount of 
English and Japanese sentences will be created. We 
will closely coordinate the further development of 
our corpus with the annotation work in Verbmobil 
and with other German efforts in corpus annotation. 

Since the combinatorics of syntactic constructions 
creates a demand for very large corpora, efficiency of 



annotation is an important criterion for the success 
of the developed methodology and tools. Our anno- 
tation tool supplies efficient manipulation and im- 
mediate visualization of argument structures. Par- 
tial automation included in the current version sig- 
nificantly reduces the manual effort. Its extension is 
subject to further investigations. 
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